159 research outputs found
Classification-Aware Hidden-Web Text Database Selection,
Many valuable text databases on the web have noncrawlable contents that are “hidden” behind
search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web”
text databases at once through a unified query interface. An important step in the metasearching
process is database selection, or determining which databases are the most relevant for a given
user query. The state-of-the-art database selection techniques rely on statistical summaries of the
database contents, generally including the database vocabulary and associated word frequencies.
Unfortunately, hidden-web text databases typically do not export such summaries, so previous research
has developed algorithms for constructing approximate content summaries from document
samples extracted from the databases via querying.We present a novel “focused-probing” sampling
algorithm that detects the topics covered in a database and adaptively extracts documents that
are representative of the topic coverage of the database. Our algorithm is the first to construct
content summaries that include the frequencies of the words in the database. Unfortunately, Zipf’s
law practically guarantees that for any relatively large database, content summaries built from
moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete
content summaries might negatively affect the database selection process, especially for short
queries with infrequent words. To enhance the sparse document samples and improve the database
selection decisions, we exploit the fact that topically similar databases tend to have similar
vocabularies, so samples extracted from databases with a similar topical focus can complement
each other. We have developed two database selection algorithms that exploit this observation.
The first algorithm proceeds hierarchically and selects the best categories for a query, and then
sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data,
to enhance the database content summaries with category-specific words.We describe how to modify
existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is
beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases
as well as TREC data, suggests that the proposed sampling methods generate high-quality
content summaries and that the database selection algorithms produce significantly more relevant
database selection decisions and overall search results than existing algorithms.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Recommended from our members
Extracting Relations from Large Plain-Text Collections
Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queriesor for running data mining tasks. We explore a technique for extracting such tables from document collections that requires only a handful of training examples from users. These examples are used to generate extraction patterns,that in turn result in new tuples being extracted from the document collection. We build on this idea and present our Snowball system. Snowball introduces novel strategies for generating patterns and extracting tuples from plain-text documents. At each iteration of the extraction process, Snowball evaluates the quality of these patterns and tuples without human intervention,In this paper we also develop a scalable evaluation methodology and metrics for our task, and present a thorough experimental evaluation of Snowball and comparable techniques over a collection of more than 300,000 newspaper documents
Recommended from our members
Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes
Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. The content summaries that result from this algorithm are efficient to derive and more accurate than those from previously proposed probing techniques for content-summary extraction. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to produce accurate results even for imperfect content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases
Computing Geographical Scopes of Web Resources
Many information resources on the web are relevant primarily to limited geographical communities. For instance, web sites containing information on restaurants, theaters, and apartment rentals are relevant primarily to web users in geographical proximity to these locations. In contrast, other information resources are relevant to a broader geographical community. For instance, an on-line newspaper may be relevant to users across the United States. Unfortunately, most current web search engines largely ignore the geographical scope of web resources. In this paper, we introduce techniques for automatically computing the geographical scope of web resources, based on the textual content of the resources, as well as on the geographical distribution of hyperlinks to them. We report an extensive experimental evaluation of our strategies using real web data. Finally, we describe a geographically-aware search engine that we have built using our techniques for determining the geographical scope of web resources
Combining Strategies for Extracting Relations from Text Collections
Text documents often contain valuable structured data that is hidden in regular English sentences. This data is best exploited if available as a relational table that we could use for answering precise queries or for running data mining tasks. Our Snowball system extracts these relations from document collections starting with only a handful of user-provided example tuples. Based on these tuples, Snowball generates patterns that are used, in turn, to find more tuples. In this paper we introduce a new pattern and tuple generation scheme for Snowball, with different strengths and weaknesses than those of our original system. We also show preliminary results on how we can combine the two versions of Snowball to extract tuples more accurately
Beyond Trending Topics: Real-World Event Identification on Twitter
User-contributed messages on social media sites such as Twitter have emerged as powerful, real-time means of information sharing on the Web. These short messages tend to reflect a variety of events in real time, earlier than other social media sites such as Flickr or YouTube, making Twitter particularly well suited as a source of real-time event content. In this paper, we explore approaches for analyzing the stream of Twitter messages to distinguish between messages about real-world events and non-event messages. Our approach relies on a rich family of aggregate statistics of topically similar message clusters, including temporal, social, topical, and Twitter-centric features. Our large-scale experiments over millions of Twitter messages show the effectiveness of our approach for surfacing real-world event content on Twitter
Recommended from our members
QProber: A System for Automatic Classification of Hidden-Web Resources
The contents of many valuable web-accessible databases are only available through search interfaces and are hence invisible to traditional web "crawlers." Recently, commercial web sites have started to manually organize web-accessible databases into Yahoo!-like hierarchical classification schemes. Here, we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the Web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for
processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can
exploit search engine indexes and retrieve the documents of interest via carefully crafted queries
constructed in task-specific ways. The choice between crawl- and query-based execution plans can
have a substantial impact on both execution time and output “completeness” (e.g., in terms of
recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition.
In this article, we present fundamental building blocks to make the choice of execution plans
for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze
query- and crawl-based plans in terms of both execution time and output completeness. We adapt
results from random-graph theory and statistics to develop a rigorous cost model for the execution
plans. Our cost model reflects the fact that the performance of the plans depends on fundamental
task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.We also present two
optimization approaches for text-centric tasks that rely on the cost-model parameters and select
efficient execution plans. Overall, our optimization approaches help build efficient execution plans
for a task, resulting in significant efficiency and output completeness benefits. We complement our
results with a large-scale experimental evaluation for three important text-centric tasks and over
multiple real-life data sets.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
- …